{smcl}
{* 16aug2016}{...}
{cmd:help textreg_train}{right: ({browse "http://www.stata-journal.com/":SJxx-x: dm00xx})}
{hline}

{title:Title}

{p2colset 5 20 22 2}
{p2col :{hi:textreg_train} {hline 2} Predicting Citations based on Paper Titles in Stata}{p_end}
{p2colreset}{...}


{title:Syntax}

{p 8 17 2}
{cmd:textreg_train } y X {cmd:using} filename [{cmd:,} {it:options}]

{synoptset 30 tabbed}{...}
{synopthdr}
{synoptline}

{syntab:Options}
{synopt :{opth m:odel(strings:string)}} specifies if OLS regression or logit should be used. {p_end}
{synopt :{opth r:egularization(strings:string)}} specifies the regularization term. {p_end}
{synopt :{opth regu_range:(strings:string)}} specifies the values of the regularization term. {p_end}
{synopt :{opth s:eed(it:integer)}} specifies the random seed. {p_end}
{synopt :{opt tf:idf}} specifies if term-frequency-inverse-document-frequency should be used.{p_end}
{synopt :{opt st:em}} specifies if words should be stemmed. {p_end}
{synopt :{opth stem_lang:(strings:string)}} specifies the language of the stemmer. {p_end}
{synopt :{opth stop:words(strings:string)}} specifies which stopwords to exclude. {p_end}
{synopt :{opth n:grams(it:integer)}} number of n-grams to be used. {p_end}
{synopt :{opth min_freq:(it:integer)}} minimum number of frequency for words. {p_end}
{synopt :{opth max_freq:(it:real)}} maximum number of frequency for words. {p_end}
{synopt :{opth max_voc:(it:integer)}} maximum size of the vocabulary. {p_end}


{synoptline}

{pstd}See {help textreg_train##Options:{it:Options}} for details on specifying options.

 
{title:Description} 

{p 4 4 2} {cmd:textreg_train} allows to train a text regression model for a dependent variable (y) and a string variable (X).


{p 4 4 2} {cmd:textreg_train} generates a new file containing the trained text regression model. 

{marker Options}{...}


{title:Options}

{phang} {cmd:model(}{it:string}{cmd:)} specifies if an OLS or logit regression will be 
used to train the model. OLS regression is used by specifying "reg". To use logit specify
"logit". By default OLS will be used. 

{phang} {cmd:regularization(}{it:string}{cmd:)} specifies which form of regularization 
should be used. The options are either "lasso", "ridge", or "elasticnet". By defaulft
"ridge" regularization will be used. 

{phang} {cmd:regu_range(}{it:string}{cmd:)} specifies which values of the regularization
parameter should be tested in the 10-fold cross-validation. The string should give the start
and the end value seperated by a comma. The default is {\ttregu_range("0.1,1")} such that
values between 0.1 and 1 in steps of 0.1 are used. In least squares regression larger values
imply stronger regularization. For logistic regression smaller values imply stronger 
regularization. 

{phang} {cmd:seed(}{it:integer}{cmd:)} specifies the seed for the random number generator.

{phang} {cmd:tfidf} specifies if term-frequency-inverse-document-frequency (tf-idf) should
be used. If yes the document-ngram-matrix is reweighted using tf-idf before training 
the text regression model.

{phang} {cmd:stem} specifies if the words should be stemmed before the estimation of 
the text regression model. This will reduce the words to their morphological roots
(e.g., walked to walk). The stemming implementation relies on Pythons NLTK package.

{phang} {cmd:stem_lang(}{it:string}{cmd:)} specifies the language of the text strings.
For a list of supported languages see https://www.nltk.org/_modules/nltk/stem/snowball.html.

{phang} {cmd:stopwords(}{it:string}{cmd:)} list of stopwords (very common words)
to exclude from the text regression (e.g., "I", "me").

{phang} {cmd:nrams(}{it:integer}{cmd:)} specifies which order of n-grams should
be included in the text regression. For example, specifying 2 implies the use of unigrams and bigrams.
3 additionally uses trigrams. By default only unigrams are used.

{phang} {cmd:min_freq(}{it:integer}{cmd:)}  allows the removal of words that appear 
in few documents.  Words that appear in fewer documents than
{cmd:min_freq(}{it:integer}{cmd:)} will be excluded from the text regression.  
The default is {cmd:min_freq(0)}.

{phang} {cmd:max_freq(}{it:real}{cmd:)} allows the removal of words that appear
frequently in documents.  Words that appear in a share of more than
{cmd:max_freq(}{it:real}{cmd:)} documents will be excluded from LSA. The
default is {cmd:max_freq(1)}.

{phang} {cmd:max_voc(}{it:integer}{cmd:)} allows to specify a maximum number of
n-grams to be included in the text regression.


{title:Remarks}

{pstd} To run {cmd: textreg_train} the user needs to specify the outcome variables (y) and the variable containing the text strings (X) for the text regression model. The options allow to adjust the training of the text regressions. 

{p 4 4 2} {cmd: textreg_train} generates a new file containing the trained text regression model based on the filename and path specified in {cmd:using}. If {cmd:using} is not specified a file called "textreg_model.pkl" will be generated in the current working directory. 




{title:Examples}

To train a text regression model:

{p 4 8 2}{cmd:. textreg_train} y text {p_end}

{p 4 8 2}{cmd:. textreg_train} y text using "$path/Models/textreg_model.pkl", model("reg") regularization("ridge") regu_range("1,10") ngrams(2) seed(1502) tfidf stem stem_Lang("english") stopwords("I me he she it we you us") min_freq(10) max_freq(0.3) max_voc(100000)  {p_end}




{title:Authors}

{pstd}Carlo Schwarz{p_end}
{pstd}Bocconi University{p_end}
{pstd}Italy{p_end}
{pstd}{browse "www.carloschwarz.eu"}{p_end}
{pstd}carlo.schwarz@unibocconi.it{p_end}



{title:Also see}  

{p 4 14 2}
Article:  {it:Stata Journal}, volume x, number x: {browse "http://www.stata-journal.com/":dm00xx}

